Fine-mapping studies distinguish genetic risks for childhood- and adult-onset asthma in the HLA region

Background Genome-wide association studies of asthma have revealed robust associations with variation across the human leukocyte antigen (HLA) complex with independent associations in the HLA class I and class II regions for both childhood-onset asthma (COA) and adult-onset asthma (AOA). However, the specific variants and genes contributing to risk are unknown. Methods We used Bayesian approaches to perform genetic fine-mapping for COA and AOA (n=9432 and 21,556, respectively; n=318,167 shared controls) in White British individuals from the UK Biobank and to perform expression quantitative trait locus (eQTL) fine-mapping in immune (lymphoblastoid cell lines, n=398; peripheral blood mononuclear cells, n=132) and airway (nasal epithelial cells, n=188) cells from ethnically diverse individuals. We also examined putatively causal protein coding variation from protein crystal structures and conducted replication studies in independent multi-ethnic cohorts from the UK Biobank (COA n=1686; AOA n=3666; controls n=56,063). Results Genetic fine-mapping revealed both shared and distinct causal variation between COA and AOA in the class I region but only distinct causal variation in the class II region. Both gene expression levels and amino acid variation contributed to risk. Our results from eQTL fine-mapping and amino acid visualization suggested that the HLA-DQA1*03:01 allele and variation associated with expression of the nonclassical HLA-DQA2 and HLA-DQB2 genes accounted entirely for the most significant association with AOA in GWAS. Our studies also suggested a potentially prominent role for HLA-C protein coding variation in the class I region in COA. We replicated putatively causal variant associations in a multi-ethnic cohort. Conclusions We highlight roles for both gene expression and protein coding variation in asthma risk and identified putatively causal variation and genes in the HLA region. A convergence of genomic, transcriptional, and protein coding evidence implicates the HLA-DQA2 and HLA-DQB2 genes and HLA-DQA1*03:01 allele in AOA. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-022-01058-2.


Fine Mapping the HLA Region
We used Sum of Single Effects (SuSiE(1)) (susieR R package version 0.9.0) to fine map the HLA loci for childhood-onset asthma (COA) and adult-onset asthma (AOA). The susieR R package does not currently allow for the inclusion of covariates, so sex and the first 10 ancestral principal components (PCs) were regressed out of the genotype matrix and phenotype vector using linear regression; we used the residuals of the genotype matrix and phenotype vector as inputs to SuSIE. The SuSiE method is based on a linear regression, and so when applied to binary data, it will estimate and test for effects in terms of risk differences, rather than the more conventional odds ratio (OR). Applying linear methods to binary data is justified here because the estimated ORs were all small (<1.3), the allele frequencies were not too extreme, and the sample size (here, limited by the smaller number of cases) was large (2)(3)(4). See Pirinen et al.
Section 3 for detailed discussion of applying linear methods to binary data and the relationship between estimated risk differences and ORs. We assumed at most L = 10 causal variants and set susieR to estimate the residual and prior variances. We retained only level-95% credible sets (coverage = 0.95). We took the additional step of discarding credible sets in which the "purity" (smallest absolute correlation among all pairs of variants within the credible set) was less than 0.50. We only considered credible sets that contained at least one variant reaching genome-wide significance to avoid any possible artifacts.

HLA Fine-Mapping Simulations
Because the HLA region is extraordinarily complex, we assessed the performance of SuSiE in this region by simulation. Existing genotype and covariate data were used to leverage the true correlation structure in the class I and class II regions to simulate both binary (e.g. case/control status) and quantitative (e.g. gene expression) outcomes.
To simulate binary (e.g. case/control status) outcomes, we used the genotype matrix X (HLA class I or class II loci defined by Pividori et al. (5)) and covariate matrix Z from the UK Biobank and set the individual-level log-odds of asthma to be for individual i, SNPs j, covariates k, fixed effect vectors and , and a fixed intercept . We used the true matrix of covariates and covariate effects estimated from a logistic regression, separately for COA and AOA simulations. was set to 0 for all non-causal variants. For causal variants, was set using effect sizes similar to what was found in the Pividori COA and AOA GWASs(5). We randomly selected 0-3 variants from a random uniform distribution to be causal (with non-zero effects) for both the class I and class II regions using both COA and AOA effect sizes. We simulated case/control status for each individual as ∼ ( ) independently and regressed out the covariates in Z from X and Y.
For quantitative outcomes (e.g. gene expression), we used X and Z in the HLA class I and HLA class II regions from the nasal epithelial cell (NEC) dataset from URECA described below. We set the individual-level mean to be = ∑ + ∑ + and we used the true matrix of covariates and effects estimated from a linear regression.
was set to 0 for all non-causal variants, and causal were set using effect sizes similar to what was found from the NEC eQTL studies (described below). We similarly randomly selected 0-3 causal signals in both the class I and class II regions and set the individual level response to be ∼ ( , 2 ). We similarly regressed out the covariates in Z from X and Y. These simulations were used to test how well SuSiE recovers the causal effects over each simulation.

Lymphoblastoid Cell Lines (LCLs)
We examined RNA-seq data previously collected from LCLs from 398 Hutterites (6). The Standard RNA-seq pipelines that map reads to a reference genome can provide biased expression estimates at the highly polymorphic HLA loci due to the potentially large number of differences between the sequence of an individual's HLA type and the reference sequence used for mapping (8,9). Expression estimates can be improved by mapping RNA-seq reads to the sequences for each individual's known HLA type (8). For the polymorphic HLA genes, we aligned RNA-seq reads to reference sequences from the IMGT database(10) for each individual's known HLA type, removing duplicate reads with WASP (11). Sequencing reads were mapped and quantified using STAR/2.6.1(12) for other genes. Samples with >7M uniquely mapped reads underwent trimmed means of M-value (TMM) normalization and voom transformation (13). We corrected for extraction date and sequencing batch with limma (14).
To perform eQTL mapping, associations between SNPs and expression of genes in the HLA class I and class II regions were performed with Genome-wide Efficient Mixed Model Association (GEMMA)(15) using a kinship matrix to correct for relatedness between Hutterite individuals. We used a linear mixed model (LMM) including age and sex as covariates and considered all variants within 1 Mb of the transcription start site (TSS) of each expressed gene.

Peripheral Blood Mononuclear Cells (PBMCs)
We examined unstimulated PBMC RNA-seq data from 132 (78 males, 54 females) African-American children from the URban Environment and Childhood Asthma (URECA) birth cohort who were 2 years old at the time of sample collection (16,17 (23) to account for unwanted variation as covariates in the analysis.

Nasal Epithelial Cells (NECs)
We examined NEC RNA-seq data from 188 (92 females, 96 males) African-American children (age 11 at time of sample collection) from the URECA cohort(24). As described above for the PBMCs, we used HLA-LA to infer HLA types from whole-genome sequences, mapped RNA-seq reads as described above, and used QTLtools to perform eQTL mapping, using sex, the first three ancestral PCs, collection site, epithelial cell proportion, sequencing batch, and seven latent factors(23) as covariates in the analysis.

Fig. S1. Ancestry PCs for the Replication and Discovery Cohorts.
Ancestry PC1, PC2, and PC3 are shown for the discovery cohort ("White British (discovery)") and the replication cohort, with the colors corresponding to self-reported ancestry. Odds ratios and 95% confidence intervals are shown for the HLA alleles that were significantly associated (p<5.0x10 -8 ) with either childhood-onset asthma (COA, blue) and/or adult-onset asthma (AOA, red). The results for all alleles for the six HLA loci are shown in Table S2. Odds ratios and 95% confidence intervals are shown for HLA amino acid polymorphisms that were significantly associated (p<5x10 -8 ) with either childhood-onset asthma (COA, blue) and/or adult-onset asthma (AOA, red).      (6)) and the PBMCs (16,17) and NECs(24) from URECA (URban Environment and Childhood Asthma). Each row is a result for each variant reflecting each credible set (CS). The original p-value and odds ratio (OR) for the risk allele are shown, then the p-value and OR when allergy was included as a covariate in the regression ("Allergy Cov"), when all individuals with allergy were excluded ("No allergy"), and the interaction between the variant and allergy status are shown. For each variant, the p-value and ORs for the risk allele are shown for the original analysis ("Original"), in just female participants, and in just males. The p-value for the interaction between sex and the variant is also shown.  (Table S6)   P-value, odds ratio (OR), and 95% confidence interval (CI) shown for the original, marginal association and for the association when conditioning on either the class I or class II signals. Number of individuals in each self-reported ethnic group are shown for childhood-onset asthma (COA), adult-onset asthma (AOA), and the non-asthmatic controls (Ctl).